10 research outputs found

    Robust causal structure learning with some hidden variables

    Full text link
    We introduce a new method to estimate the Markov equivalence class of a directed acyclic graph (DAG) in the presence of hidden variables, in settings where the underlying DAG among the observed variables is sparse, and there are a few hidden variables that have a direct effect on many of the observed ones. Building on the so-called low rank plus sparse framework, we suggest a two-stage approach which first removes the effect of the hidden variables, and then estimates the Markov equivalence class of the underlying DAG under the assumption that there are no remaining hidden variables. This approach is consistent in certain high-dimensional regimes and performs favourably when compared to the state of the art, both in terms of graphical structure recovery and total causal effect estimation

    Graphical model selection for gaussian conditional random fields in the presence of latent variables: theory and application to genetics

    No full text
    The task of performing graphical model selection arises in many applications in science and engineering. The field of application of interest in this thesis relates to the needs of datasets that include genetic and multivariate phenotypic data. There are several factors that make this problem particularly challenging: some of the relevant variables might not be observed, high-dimensionality might cause identifiability issues and, finally, it might be preferable to learn the model over a subset of the collection while conditioning on the rest of the variables, e.g. genetic variants. We suggest addressing these problems by learning a conditional Gaussian graphical model, while accounting for latent variables. Building on recent advances in this field, we decompose the parameters of a conditional Markov random field into the sum of a sparse and a low-rank matrix. We derive convergence bounds for this novel estimator, show that it is well-behaved in the high-dimensional regime and describe algorithms that can be used when the number of variables is in the thousands. Through simulations, we illustrate the conditions required for identifiability and show that this approach is consistent in a wider range of settings. In order to show the practical implications of our work, we apply our method to two real datasets and devise a metric that makes use of an independent source of information to assess the biological relevance of the estimates. In our first application, we use the proposed approach to model the levels of 39 metabolic traits conditional on hundreds of genetic variants, in two independent cohorts. We find our results to be better replicated across cohorts than the ones obtained with other methods. In our second application, we look at a high-dimensional gene expression dataset. We find that our method is capable of retrieving as many biologically relevant gene-gene interactions as other methods while retrieving fewer irrelevant interaction.</p

    Graphical model selection for gaussian conditional random fields in the presence of latent variables: theory and application to genetics

    No full text
    The task of performing graphical model selection arises in many applications in science and engineering. The field of application of interest in this thesis relates to the needs of datasets that include genetic and multivariate phenotypic data. There are several factors that make this problem particularly challenging: some of the relevant variables might not be observed, high-dimensionality might cause identifiability issues and, finally, it might be preferable to learn the model over a subset of the collection while conditioning on the rest of the variables, e.g. genetic variants. We suggest addressing these problems by learning a conditional Gaussian graphical model, while accounting for latent variables. Building on recent advances in this field, we decompose the parameters of a conditional Markov random field into the sum of a sparse and a low-rank matrix. We derive convergence bounds for this novel estimator, show that it is well-behaved in the high-dimensional regime and describe algorithms that can be used when the number of variables is in the thousands. Through simulations, we illustrate the conditions required for identifiability and show that this approach is consistent in a wider range of settings. In order to show the practical implications of our work, we apply our method to two real datasets and devise a metric that makes use of an independent source of information to assess the biological relevance of the estimates. In our first application, we use the proposed approach to model the levels of 39 metabolic traits conditional on hundreds of genetic variants, in two independent cohorts. We find our results to be better replicated across cohorts than the ones obtained with other methods. In our second application, we look at a high-dimensional gene expression dataset. We find that our method is capable of retrieving as many biologically relevant gene-gene interactions as other methods while retrieving fewer irrelevant interaction.</p

    Right singular vector projection graphs: fast high dimensional covariance matrix estimation under latent confounding

    No full text
    We consider the problem of estimating a high dimensional p×p covariance matrix Σ, given n observations of confounded data with covariance Σ + ΓΓT, where Γ is an unknown p×q matrix of latent factor loadings. We propose a simple and scalable estimator based on the projection onto the right singular vectors of the observed data matrix, which we call right singular vector projection (RSVP). Our theoretical analysis of this method reveals that, in contrast with approaches based on the removal of principal components, RSVP can cope well with settings where the smallest eigenvalue of ΓTΓ is relatively close to the largest eigenvalue of Σ, as well as when the eigenvalues of ΓTΓ are diverging fast. RSVP does not require knowledge or estimation of the number of latent factors q, but it recovers Σ only up to an unknown positive scale factor. We argue that this suffices in many applications, e.g. if an estimate of the correlation matrix is desired. We also show that, by using subsampling, we can further improve the performance of the method. We demonstrate the favourable performance of RSVP through simulation experiments and an analysis of gene expression data sets collated by the GTEX consortium.ISSN:1369-7412ISSN:0035-9246ISSN:1467-986

    Designing Brains for Pain: Human to Mollusc

    No full text
    corecore